Project: No-show appointments Data Analysis

Table of Contents

Introduction

In this project, the data set we have selected for analysis is the No-show appointments data set. This data set is a collection of the medical appointments from Brazil in the year 2016. Our main focus is to determine whether or not patients show up for appointments and find out what factors are important to know in order to predict if a patient will show up for appointment or not. The dataset consists of $14$ columns which tells us about the characteristics of the patients. Some of which include the following

The following are the questions of consideration in our exploration.

Questions

  1. Is there any relationship between a patients gender and no-show ?
  2. Do patients on Scholarship show up more than patients not on Scholarship?
  3. Is there any relationship between a patients age and no-show ?
  4. Does getting an SMS reminder have any impact on show up rate?
  5. Which neighbourhood has the most number of patients who don't show up for appointments?
  6. Is there any relationship between the patient's condition or disease and no-show?

We begin by first importing all the necessary packages that we will require for our analysis. Upon loading our data set, we will begin by first assessing and buliding initution about the data set. To do this, we will check the shape of our data set, the data types of the columns and a concise summary of the dataframe using df.info() as well as the descriptive statistics for each column of the dataframe. After assessing the data, we will then proceed to cleaning our data by checking for missing values, duplicates and also making a few changes on the column names to maintain consistency. We will then explore the data with visuals to find patterns by creating histograms, bar charts and pie charts. Finally, we will then draw conclusions based on descriptive statistics and visualizations.

Data Wrangling

General Properties

The sample size for our data set is $ 110,527$ and it consists of $14$ columns.

From the above information, we see that there are no null values present in our data set.

Data Cleaning

Checking for duplicates and null values

From the above results from the code, we see that the dataset contains no duplicate values. We now check if at all there are any missing values.

The above result confirms that there are no missing or null values in this dataset

From the summary statistics in the above code, we observe that the minimum value for age is negative. This could be as a result of a mismatch which might have occured during data collection. We can thus remove all rows which have age as a negative value. We will also drop the rows with age $0$ as it is not logical to have age as zero.

We now have dropped the negative value for age as well as all rows containing age as $0$. We now see that the minimum age is no longer negative.

We now can also drop the columns which will not be of much help in our analysis.

The Handcap column consists of numerical values 1, 2, 3 and 4 as it is difficult to encode what these numbers means or signify, we shall drop this column.

To maintain consistency in the column names, we will make a few changes to make our column names more tidy as well as correct an error in the spelling for Hipertension to Hypertension. The following are the changes which we are going to make.

We now properly encode the data in the following columns: Scholarship, Hypertension, Diabetes, Alcoholism, SMS_received and No_show. We do this so that we can make it more easier to analyse and interpret the meaning of the information in these columns. The following are the changes which we are going to make.

Exploratory Data Analysis

Research Question 1 Is there any relationship between a patients gender and no-show ?

We now create a new data set consisting of patients who don't show up for appointments.

From the calculation of the proportions above, we can observe that females are more likely to miss appointments than males with a very small margin.

From the bar chart above we can conclude that females are more likely to miss appointments than males though by a very small margin.

We also see that there is a very weak positive correlation between the gender of the patients and No_show

Research Question 2 Do patients on Scholarship show up for appointments more than those who are not on Scholarship?

From the bar chart above, we observe that the proportion of patients who have no scholarships are less likely to miss appointments compared to those who have scholarships.

Research Question 3. Is there any relationship between a patients age and no-show ?

From the histogram above, we see that the distribution of age is very much skewed to the right. This means that the majority of no show appointments are common in children and adults. Most of the elderly patients are more likely to show up for appointments.

Research Question 4. Does getting an SMS reminder have any impact on show up rate?

From the pie chart above, we see that 44.1% of the patients who did not show up for appointment received Sms notifications and 55.9% of patients who did not receive the notification did not show up.

We now find the proportions of patients that did not show up for appointment and whether or not they received an Sms to remind them about the appointment.

From the bar chart above, we see that the majority of patients that received Sms did not show up for appointments

Research Question 5 Which neighbourhoods have patients who don't show up for appointments?

In order to answer the above research question, we will make use of the pie chart. We are going to investigate the top 5 neighbourhoods that do not show up for appointments.

These are among the top five Neighbourhoods that don't show up for appointments

Research Question 6. Does the kind of condition or disease the patient has play a role in determining whether a patient will show up for an appointment ?

From the pie chart above, we see that 17.4% of the patients who did not show up for appointment had hypertension and 82.6% of patients who had no hypertension did not show up. The majority of patients who did not show up for appointments had no Hypertension.

From the pie chart above, we see that 6.6% of the patients who did not show up for appointment had Diabetes and 93.4% of patients who had no Diabetes did not show up. The majority of patients who did not show up for appointments had no Diabetes

From the pie chart above, we see that 3.12% of the patients who did not show up for appointment had Alcoholism and 96.9% of patients who had no Alcoholism did not show up. The majority of patients who did not show up for appointments had no Alcoholism problem.

We now group the No_show column by Hypertension, Alcoholism, and Diabetes, in order to check if there is any relationship between the No_show and the patients medical condition.

From the above analysis, we observe that patients who have no Hypertension, Diabetes and who are not Alcoholic are more likely to miss appointments.

Conclusions

In this project, we investigated the No-show appointment dataset for Medical appointments in Brazil for the year 2016. Upon loading the data and checking for cleanliness, it was observed that the dataset contained no duplicates and no missing or null values. The only cleaning that was done was to drop some extraneous columns which were not relevant for our analysis such as patientID, Handcap and AppointmentID. One of the limitation for the Handcap column was that it contained numerical values 0, 1, 2,3 and 4 which were difficult to encode properly and as a result, this column had to be dropped. The other limitation was that the dataset contained a negative value for age and all those rows with age has $0$ were dropped. The limitation could be that there maybe there could be other better ways of dealing with such cases and also, the dataset set contained more object data types than numeric ones.

In the data cleaning process, we also made a few changes to the column names by putting underscores to ScheduledDay and AppointmentDay and changing No-show to No_show. We also made changes to the format for Scheduled_day and Appointment_day to date-time.

From the exploratory data analysis, we endeavored to answer the questions that was posed in order to determine whether or not patients show up for appointments and also find out what factors are important to know in order to predict if a patient will show up for appointment or not.

Upon undertaking the exploratory data analysis, it was observed that females are more likely to miss appointments than males though with a small margin. It was also observed that patients who have no scholarships are not likely to miss appointments as compared to those who have. We further observed that the majority of no show appointments were common in children and aldults as opposed to the eldery. In conclusion, we also observed that the majority of patients that recieved Sms did not show up for appointments.

References

https://www.youtube.com/watch?v=O4538i9MQEc

https://stackoverflow.com/questions/51241575/calculate-correlation-between-columns-of-strings

https://www.tutorialsandyou.com/matplotlib/how-to-show-percentage-and-value-in-matplotlib-pie-chart-12.html

https://www.projectpro.io/recipes/convert-string-datetime-in-python

https://www.kaggle.com/joniarroba/noshowappointments

https://www.w3schools.com/python/python_functions.asp